accuracy rate
POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
Yang, Tingyue, Yao, Junchi, Guo, Yuhui, Liu, Chang
We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.
- Law (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (0.46)
TLCD: A Deep Transfer Learning Framework for Cross-Disciplinary Cognitive Diagnosis
Wang, Zhifeng, Su, Meixin, Yang, Yang, Zeng, Chunyan, Ye, Lizhi
Driven by the dual principles of smart education and artificial intelligence technology, the online education model has rapidly emerged as an important component of the education industry. Cognitive diagnostic technology can utilize students' learning data and feedback information in educational evaluation to accurately assess their ability level at the knowledge level. However, while massive amounts of information provide abundant data resources, they also bring about complexity in feature extraction and scarcity of disciplinary data. In cross-disciplinary fields, traditional cognitive diagnostic methods still face many challenges. Given the differences in knowledge systems, cognitive structures, and data characteristics between different disciplines, this paper conducts in-depth research on neural network cognitive diagnosis and knowledge association neural network cognitive diagnosis, and proposes an innovative cross-disciplinary cognitive diagnosis method (TLCD). This method combines deep learning techniques and transfer learning strategies to enhance the performance of the model in the target discipline by utilizing the common features of the main discipline. The experimental results show that the cross-disciplinary cognitive diagnosis model based on deep learning performs better than the basic model in cross-disciplinary cognitive diagnosis tasks, and can more accurately evaluate students' learning situation.
- Asia > China > Hubei Province > Wuhan (0.05)
- Asia > Macao (0.04)
- Europe > Greece (0.04)
- (3 more...)
- Instructional Material (0.68)
- Research Report > New Finding (0.34)
- Education > Educational Setting > Online (1.00)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Technology > Educational Software > Computer Based Training (0.66)
Toward Errorless Training ImageNet-1k
In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.
- North America > United States > Nebraska > Lancaster County > Lincoln (0.14)
- North America > United States > Colorado > El Paso County > Colorado Springs (0.04)
Triple-Stream Deep Feature Selection with Metaheuristic Optimization and Machine Learning for Multi-Stage Hypertensive Retinopathy Diagnosis
Suyun, Suleyman Burcin, Yurdakul, Mustafa, Tasdemir, Sakir, Bilic, Serkan
Hypertensive retinopathy (HR) is a severe eye disease that may cause permanent vision loss if not diagnosed early. Traditional diagnostic methods are time-consuming and subjective, highlighting the need for an automated, reliable system. Existing studies often use a single Deep Learning (DL) model, struggling to distinguish HR stages. This study introduces a three-stage approach to enhance HR diagnosis accuracy. Initially, 14 CNN models were tested, identifying DenseNet169, MobileNet, and ResNet152 as the most effective. DenseNet169 achieved 87.73% accuracy, 87.75% precision, 87.73% recall, 87.67% F1-score, and 0.8359 Cohen's Kappa. MobileNet followed with 86.40% accuracy, 86.60% precision, 86.40% recall, 86.31% F1-score, and 0.8180 Cohen's Kappa. ResNet152 ranked third with 85.87% accuracy, 86.01% precision, 85.87% recall, 85.83% F1-score, and 0.8188 Cohen's Kappa. In the second stage, deep features from these models were fused and classified using Machine Learning (ML) algorithms (SVM, RF, XGBoost). SVM (sigmoid kernel) performed best with 92.00% accuracy, 91.93% precision, 92.00% recall, 91.91% F1-score, and 0.8930 Cohen's Kappa. The third stage applied meta-heuristic optimization (GA, ABC, PSO, HHO) for feature selection. HHO yielded 94.66% accuracy, precision, and recall, 94.64% F1-score, and 0.9286 Cohen's Kappa. The proposed approach surpassed single CNN models and previous studies in HR diagnosis accuracy and generalization.
- Asia > Middle East > Republic of Türkiye > İzmir Province > İzmir (0.04)
- Asia > Middle East > Republic of Türkiye > Konya Province > Konya (0.04)
- Asia > Middle East > Iran (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
A Comparative Study of Machine Learning Algorithms for Stock Price Prediction Using Insider Trading Data
Chakravorty, Amitabh, Elsayed, Nelly
The research paper empirically investigates several machine learning algorithms to forecast stock prices depending on insider trading information. Insider trading offers special insights into market sentiment, pointing to upcoming changes in stock prices. This study examines the effectiveness of algorithms like decision trees, random forests, support vector machines (SVM) with different kernels, and K-Means Clustering using a dataset of Tesla stock transactions. Examining past data from April 2020 to March 2023, this study focuses on how well these algorithms identify trends and forecast stock price fluctuations. The paper uses Recursive Feature Elimination (RFE) and feature importance analysis to optimize the feature set and, hence, increase prediction accuracy. While it requires substantially greater processing time than other models, SVM with the Radial Basis Function (RBF) kernel displays the best accuracy. This paper highlights the trade-offs between accuracy and efficiency in machine learning models and proposes the possibility of pooling multiple data sources to raise prediction performance. The results of this paper aim to help financial analysts and investors in choosing strong algorithms to optimize investment strategies.
- North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- North America > Mexico (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- Banking & Finance > Trading (1.00)
- Transportation > Ground > Road (0.34)
Robust Knowledge Distillation in Federated Learning: Counteracting Backdoor Attacks
Alharbi, Ebtisaam, Marcolino, Leandro Soriano, Ni, Qiang, Gouglidis, Antonios
Federated Learning (FL) enables collaborative model training across multiple devices while preserving data privacy. However, it remains susceptible to backdoor attacks, where malicious participants can compromise the global model. Existing defence methods are limited by strict assumptions on data heterogeneity (Non-Independent and Identically Distributed data) and the proportion of malicious clients, reducing their practicality and effectiveness. To overcome these limitations, we propose Robust Knowledge Distillation (RKD), a novel defence mechanism that enhances model integrity without relying on restrictive assumptions. RKD integrates clustering and model selection techniques to identify and filter out malicious updates, forming a reliable ensemble of models. It then employs knowledge distillation to transfer the collective insights from this ensemble to a global model. Extensive evaluations demonstrate that RKD effectively mitigates backdoor threats while maintaining high model performance, outperforming current state-of-the-art defence methods across various scenarios.
- Europe > United Kingdom (0.14)
- Asia > Middle East > Saudi Arabia (0.04)
Can AI Help with Your Personal Finances?
Hean, Oudom, Saha, Utsha, Saha, Binita
In recent years, Large Language Models (LLMs) have emerged as a transformative development in artificial intelligence (AI), drawing significant attention from industry and academia. Trained on vast datasets, these sophisticated AI systems exhibit impressive natural language processing and content generation capabilities. This paper explores the potential of LLMs to address key challenges in personal finance, focusing on the United States. We evaluate several leading LLMs, including OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude, and Meta's Llama, to assess their effectiveness in providing accurate financial advice on topics such as mortgages, taxes, loans, and investments. Our findings show that while these models achieve an average accuracy rate of approximately 70%, they also display notable limitations in certain areas. Specifically, LLMs struggle to provide accurate responses for complex financial queries, with performance varying significantly across different topics. Despite these limitations, the analysis reveals notable improvements in newer versions of these models, highlighting their growing utility for individuals and financial advisors. As these AI systems continue to evolve, their potential for advancing AI-driven applications in personal finance becomes increasingly promising.
- Law (1.00)
- Banking & Finance > Financial Services (1.00)
- Information Technology > Security & Privacy (0.93)
- (2 more...)
A Novel Task-Driven Method with Evolvable Interactive Agents Using Event Trees for Enhanced Emergency Decision Support
Xiao, Xingyu, Chen, Peng, Qi, Ben, Liang, Jingang, Tong, Jiejuan, Wang, Haitao
As climate change and other global challenges increase the likelihood of unforeseen emergencies, the limitations of human-driven strategies in critical situations become more pronounced. Inadequate pre-established emergency plans can lead operators to become overwhelmed during complex systems malfunctions. This study addresses the urgent need for agile decision-making in response to various unforeseen incidents through a novel approach, EvoTaskTree (a task-driven method with evolvable interactive agents using event trees for emergency decision support). This advanced approach integrates two types of agents powered by large language models (LLMs): task executors, responsible for executing critical procedures, and task validators, ensuring the efficacy of those actions. By leveraging insights from event tree analysis, our framework encompasses three crucial tasks: initiating event subevent analysis, event tree header event analysis, and decision recommendations. The agents learn from both successful and unsuccessful responses from these tasks. Finally, we use nuclear power plants as a demonstration of a safety-critical system. Our findings indicate that the designed agents are not only effective but also outperform existing approaches, achieving an impressive accuracy rate of up to 100 % in processing previously unencoun32 tered incident scenarios. This paper demonstrates that EvoTaskTree significantly enhances the rapid formulation of emergency decision-making.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > New York (0.04)
- Europe > Spain (0.04)
- Information Technology > Decision Support Systems (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)
A Machine Learning Approach for Emergency Detection in Medical Scenarios Using Large Language Models
Akaybicen, Ferit, Cummings, Aaron, Iwuagwu, Lota, Zhang, Xinyue, Adewuyi, Modupe
The rapid identification of medical emergencies through digital communication channels remains a critical challenge in modern healthcare delivery, particularly with the increasing prevalence of telemedicine. This paper presents a novel approach leveraging large language models (LLMs) and prompt engineering techniques for automated emergency detection in medical communications. We developed and evaluated a comprehensive system using multiple LLaMA model variants (1B, 3B, and 7B parameters) to classify medical scenarios as emergency or non-emergency situations. Our methodology incorporated both system prompts and in-prompt training approaches, evaluated across different hardware configurations. The results demonstrate exceptional performance, with the LLaMA 2 (7B) model achieving 99.7% accuracy and the LLaMA 3.2 (3B) model reaching 99.6% accuracy with optimal prompt engineering. Through systematic testing of training examples within the prompts, we identified that including 10 example scenarios in the model prompts yielded optimal classification performance. Processing speeds varied significantly between platforms, ranging from 0.05 to 2.2 seconds per request. The system showed particular strength in minimizing high-risk false negatives in emergency scenarios, which is crucial for patient safety. The code implementation and evaluation framework are publicly available on GitHub, facilitating further research and development in this crucial area of healthcare technology.
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Health Care Technology > Telehealth (0.90)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
Random Sampling for Diffusion-based Adversarial Purification
Zhang, Jiancheng, Dong, Peiran, Chen, Yongyong, Zhao, Yin-Ping, Guo, Song
Denoising Diffusion Probabilistic Models (DDPMs) have gained great attention in adversarial purification. Current diffusion-based works focus on designing effective condition-guided mechanisms while ignoring a fundamental problem, i.e., the original DDPM sampling is intended for stable generation, which may not be the optimal solution for adversarial purification. Inspired by the stability of the Denoising Diffusion Implicit Model (DDIM), we propose an opposite sampling scheme called random sampling. In brief, random sampling will sample from a random noisy space during each diffusion process, while DDPM and DDIM sampling will continuously sample from the adjacent or original noisy space. Thus, random sampling obtains more randomness and achieves stronger robustness against adversarial attacks. Correspondingly, we also introduce a novel mediator conditional guidance to guarantee the consistency of the prediction under the purified image and clean image input. To expand awareness of guided diffusion purification, we conduct a detailed evaluation with different sampling methods and our random sampling achieves an impressive improvement in multiple settings. Leveraging mediator-guided random sampling, we also establish a baseline method named DiffAP, which significantly outperforms state-of-the-art (SOTA) approaches in performance and defensive stability. Remarkably, under strong attack, our DiffAP even achieves a more than 20% robustness advantage with 10$\times$ sampling acceleration.
- Asia > China > Hong Kong (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)